2025-05-14-12-04
Winning at All Cost: A Small Environment for Eliciting Specification Gaming Behaviors in Large Language Models
Abstract
arXiv:2505.07846v1 Announce Type: new Abstract: This study reveals how frontier Large Language Models LLMs can "game the system" when faced with impossible situations, a critical security and alignment concern. Using a novel textual simulation approach, we presented three leading LLMs (o1, o3-mini, and r1) with a tic-tac-toe scenario designed to be unwinnable through legitimate play, then analyzed their tendency to exploit loopholes rather than accept defeat. Our results are alarming for security researchers: the newer, reasoning-focused o3-mini model showed nearly twice the propensity to exploit system vulnerabilities (37.1%) compared to the older o1 model (17.5%). Most striking was the effect of prompting. Simply framing the task as requiring "creative" solutions caused gaming behaviors to skyrocket to 77.3% across all models. We identified four distinct exploitation strategies, from direct manipulation of game state to sophisticated modification of opponent behavior. These findings demonstrate that even without actual execution capabilities, LLMs can identify and propose sophisticated system exploits when incentivized, highlighting urgent challenges for AI alignment as models grow more capable of identifying and leveraging vulnerabilities in their operating environments.
摘要
本研究揭示了前沿大语言模型(LLMs)在面临不可能情境时如何"钻系统空子",这一发现对安全性和对齐性具有重要警示意义。通过创新的文本模拟方法,我们让三个领先的LLM模型(o1、o3-mini和r1)面对一个通过合法玩法无法获胜的井字棋场景,进而分析它们倾向于利用漏洞而非认输的行为。研究结果对安全研究人员发出警报:较新型、注重推理的o3-mini模型表现出近两倍于旧版o1模型(17.5%)的系统漏洞利用倾向(37.1%)。最显著的是提示词的影响——仅需将任务描述为需要"创造性"解决方案,所有模型的钻空行为就激增至77.3%。我们识别出四种不同的利用策略,从直接操纵游戏状态到复杂修改对手行为。这些发现表明,即使没有实际执行能力,当存在激励时,LLMs仍能识别并提出复杂的系统利用方案,这突显了随着模型识别和利用运行环境漏洞能力的提升,AI对齐问题面临的紧迫挑战。
Lost in Transmission: When and Why LLMs Fail to Reason Globally
Abstract
arXiv:2505.08140v1 Announce Type: new Abstract: Despite their many successes, transformer-based large language models (LLMs) continue to struggle with tasks that require complex reasoning over large parts of their input. We argue that these failures arise due to capacity limits on the accurate flow of information within LLMs. To formalize this issue, we introduce the bounded attention prefix oracle (BAPO) model, a new computational framework that models bandwidth constraints on attention heads, the mechanism for internal communication in LLMs. We show that several important reasoning problems like graph reachability require high communication bandwidth for BAPOs to solve; we call these problems BAPO-hard. Our experiments corroborate our theoretical predictions: GPT-4, Claude, and Gemini succeed on BAPO-easy tasks and fail even on relatively small BAPO-hard tasks. BAPOs also reveal another benefit of chain of thought (CoT): we prove that breaking down a task using CoT can turn any BAPO-hard problem into a BAPO-easy one. Our results offer principled explanations for key LLM failures and suggest directions for architectures and inference methods that mitigate bandwidth limits.
摘要
尽管取得了诸多成功,基于Transformer架构的大语言模型(LLM)在处理需要对其输入内容进行复杂推理的任务时仍存在困难。我们认为这些失败源于LLM内部信息流动准确性的容量限制。为系统阐述该问题,我们提出有界注意力前缀预言机(BAPO)模型——一种模拟注意力头部带宽约束的新计算框架(注意力机制是LLM内部通信的核心组件)。我们证明若干重要推理问题(如图可达性)需要BAPO具备高通信带宽才能解决,这类问题被定义为BAPO难题。实验验证了理论预测:GPT-4、Claude和Gemini能完成BAPO简易任务,但在相对小规模的BAPO难题上也会失败。BAPO还揭示了思维链(CoT)的另一优势:我们证明通过CoT分解任务可将任何BAPO难题转化为BAPO易解问题。这些发现为LLM关键失效模式提供了原理性解释,并为突破带宽限制的架构设计和推理方法指明了方向。
Patchwork: A Unified Framework for RAG Serving
Abstract
arXiv:2505.07833v1 Announce Type: new Abstract: Retrieval Augmented Generation (RAG) has emerged as a new paradigm for enhancing Large Language Model reliability through integration with external knowledge sources. However, efficient deployment of these systems presents significant technical challenges due to their inherently heterogeneous computational pipelines comprising LLMs, databases, and specialized processing components. We introduce Patchwork, a comprehensive end-to-end RAG serving framework designed to address these efficiency bottlenecks. Patchwork's architecture offers three key innovations: First, it provides a flexible specification interface enabling users to implement custom RAG pipelines. Secondly, it deploys these pipelines as distributed inference systems while optimizing for the unique scalability characteristics of individual RAG components. Third, Patchwork incorporates an online scheduling mechanism that continuously monitors request load and execution progress, dynamically minimizing SLO violations through strategic request prioritization and resource auto-scaling. Our experimental evaluation across four distinct RAG implementations demonstrates that Patchwork delivers substantial performance improvements over commercial alternatives, achieving throughput gains exceeding 48% while simultaneously reducing SLO violations by ~24%.
摘要
检索增强生成(RAG)作为一种通过整合外部知识源来提升大语言模型可靠性的新范式已经兴起。然而,由于这类系统本质上由大语言模型、数据库和专用处理组件构成的异构计算管道,其高效部署面临着重大技术挑战。我们提出了Patchwork——一个旨在解决这些效率瓶颈的端到端RAG服务框架。该架构具有三项关键创新:首先,它提供了灵活的规范接口,使用户能够实现自定义RAG流程;其次,它将流程部署为分布式推理系统,同时针对各RAG组件独特的可扩展性特征进行优化;第三,Patchwork整合了在线调度机制,持续监控请求负载与执行进度,通过策略性请求优先级调度和资源自动扩缩容,动态减少服务等级目标(SLO)违约。在四种不同RAG实现上的实验评估表明,Patchwork相较商业替代方案展现出显著性能提升,吞吐量增益超过48%,同时将SLO违约率降低约24%。
Benchmarking AI scientists in omics data-driven biological research
Abstract
arXiv:2505.08341v1 Announce Type: new Abstract: The rise of large language models and multi-agent systems has sparked growing interest in AI scientists capable of autonomous biological research. However, existing benchmarks either focus on reasoning without data or on data analysis with predefined statistical answers, lacking realistic, data-driven evaluation settings. Here, we introduce the Biological AI Scientist Benchmark (BaisBench), a benchmark designed to assess AI scientists' ability to generate biological discoveries through data analysis and reasoning with external knowledge. BaisBench comprises two tasks: cell type annotation on 31 expert-labeled single-cell datasets, and scientific discovery through answering 198 multiple-choice questions derived from the biological insights of 41 recent single-cell studies. Systematic experiments on state-of-the-art AI scientists and LLM agents showed that while promising, current models still substantially underperform human experts on both tasks. We hope BaisBench will fill this gap and serve as a foundation for advancing and evaluating AI models for scientific discovery. The benchmark can be found at: https://github.com/EperLuo/BaisBench.
摘要
大型语言模型与多智能体系统的兴起,引发了人们对能够自主开展生物学研究的人工智能科学家的日益关注。然而现有基准测试要么聚焦于无数据支持的推理任务,要么局限于提供预设统计答案的数据分析,缺乏真实数据驱动的评估场景。为此,我们提出生物AI科学家基准(BaisBench),该基准旨在评估AI科学家通过数据分析与外部知识推理生成生物学发现的能力。BaisBench包含两项任务:基于31个专家标注单细胞数据集的细胞类型注释,以及通过回答198道源自41项最新单细胞研究生物学见解的多选题来实现科学发现。针对前沿AI科学家与LLM智能体的系统实验表明,尽管当前模型展现出潜力,但在两项任务上的表现仍显著低于人类专家水平。我们希望BaisBench能够填补这一空白,并为科学发现AI模型的推进与评估奠定基础。基准测试地址:https://github.com/EperLuo/BaisBench。
Decoding Neighborhood Environments with Large Language Models
Abstract
arXiv:2505.08163v1 Announce Type: new Abstract: Neighborhood environments include physical and environmental conditions such as housing quality, roads, and sidewalks, which significantly influence human health and well-being. Traditional methods for assessing these environments, including field surveys and geographic information systems (GIS), are resource-intensive and challenging to evaluate neighborhood environments at scale. Although machine learning offers potential for automated analysis, the laborious process of labeling training data and the lack of accessible models hinder scalability. This study explores the feasibility of large language models (LLMs) such as ChatGPT and Gemini as tools for decoding neighborhood environments (e.g., sidewalk and powerline) at scale. We train a robust YOLOv11-based model, which achieves an average accuracy of 99.13% in detecting six environmental indicators, including streetlight, sidewalk, powerline, apartment, single-lane road, and multilane road. We then evaluate four LLMs, including ChatGPT, Gemini, Claude, and Grok, to assess their feasibility, robustness, and limitations in identifying these indicators, with a focus on the impact of prompting strategies and fine-tuning. We apply majority voting with the top three LLMs to achieve over 88% accuracy, which demonstrates LLMs could be a useful tool to decode the neighborhood environment without any training effort.
摘要
邻里环境包含住房质量、道路及人行道等物理与环境条件,这些因素对人类健康与福祉具有显著影响。传统评估方法(如实地调查和地理信息系统)需要大量资源,难以实现大规模环境评估。尽管机器学习为自动化分析提供了可能,但训练数据标注的繁琐过程及可用模型的缺乏阻碍了其扩展应用。本研究探讨了ChatGPT、Gemini等大语言模型作为大规模解码邻里环境(如人行道与电力线)工具的可行性。我们训练了基于YOLOv11的鲁棒模型,在检测街灯、人行道、电力线、公寓楼、单车道及多车道道路六类环境指标时达到99.13%的平均准确率。随后评估了ChatGPT、Gemini、Claude和Grok四款大语言模型在识别这些指标时的可行性、鲁棒性及局限性,重点分析了提示策略与微调的影响。通过采用前三名大语言模型的多数投票法,实现了超过88%的准确率,证明大语言模型无需训练即可成为解码邻里环境的有效工具。